Pattern Differentiations and Formulations for Heterogeneous Genomic Data through Hybrid Approaches

نویسنده

  • Arpad Kelemen
چکیده

Pattern differentiations and formulations are two main research tracks for heterogeneous genomic data pattern analysis. In this chapter, we develop hybrid methods to tackle the major challenges of power and reproducibility of the dynamic differential gene temporal patterns. The significant differentially expressed genes are selected not only from significant statistical analysis of microarrays but also supergenes resulting from singular value decomposition for extracting the gene components which can maximize the total predictor variability. Furthermore, hybrid clustering methods are developed based on resulting profiles from several clustering methods. We demonstrate the developed hybrid analysis through an application to a time course gene expression data from interferon-β-1a treated multiple sclerosis patients. The resulting integratedcondensed clusters and overrepresented gene lists demonstrate that the hybrid methods can successfully be applied. The post analysis includes function analysis and pathway discovery to validate the findings of the hybrid methods. IDEA GROUP PUBLISHING This paper appears in the publication, Advanced Data Mining Technologies in Bioinformatics edited by Hui-Huang Hsu Reichgelt © 2006, Idea Group Inc. 701 E. Chocolate Avenue, Suite 200, Hershey PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.idea-group.com ITB12943 Pattern Differentiations and Formulations for Heterogeneous Genomic Data 137 Copyright © 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Introduction Progress in mapping the human genome and developments in microarray technologies have provided considerable amount of information for delineating the roles of genes in disease states. Since complex diseases typically involve multiple intercorrelated genetic and environmental factors that interact in a hierarchical fashion and the clinical characteristics of diseases are determined by a network of interrelated biological traits, microarrays hold tremendous latent information but their analysis is still a bottleneck. Pattern analysis can be useful for discovering the knowledge on gene array data related to certain diseases (Neal et al., 2000; Slonim, 2002). The associations between patterns and their causes are the bricks from which the wall of biological knowledge and medical decisions are built. Pattern differentiations and pattern formulations are two major tracks of patterns analysis. Pattern differentiation of gene expressions is the first step to identify potential relevant genes in biological processes. The coordinated/temporal gene arrays are widely used for pattern formulation in order to study the common functionalities, co-regulations, and pathways that ultimately are responsible for the observed patterns. The identification of groups of genes with “similar” temporal patterns of expression is usually a critical step in the analysis of kinetic data because it provides insights into the genegene interactions and thereby facilitates the testing and development of mechanistic models for the regulation of the underlying biological processes. These temporal pattern analyses provide clues for genes that are related in their expression through linkage in a common developmental pathway. There are several critical challenges in the pattern analyses. One is in the pattern differentiations, the notorious “large p small n” problem (West, 2000). The large number of irrelevant and redundant genes with high level noise measurements and uncertainty severely degrade both classification and prediction accuracy. The solution for the “large p” problem is through affine transformation and feature selection. Affine transformation such as principal component analysis (PCA) or singular value decomposition (SVD) has advantage of simplicity and it may remove non-discriminating and irrelevant features (i.e., genes) by extracting eigenfeatures corresponding to the large eigenvalues (Alter, Brown, & Botstein, 2000; Yeung and Ruzzo, 2001; Wall, Dyck, & Brettin, 2001). Yet it is very difficult to identify important genes with these methods and the inherent linear nature is their prominent disadvantage. Feature selection consists of two strategies: screening and wrappers. In the screening approaches, all genes are analyzed and tested individually to see whether they have higher expression level in one class than in the other (Baldi & Long 2001; Tusher, Tibshirani, & Chu, 2001; Storey & Tibshirani 2003; Hastiel et al., 2000). The disadvantage of screening processes is that they are non-invertible and can cause multiple testing and model selection problems (Westfall & Young, 1993; Benjamini & Hochberg, 2002). In wrapper methods, genes are tested not independently, but as ensembles, and according to their performance in the classification model (Golub et al., 1999; Khan et al., 2001; Wuju & Momiao, 2002). Since the number of feature subsets increases exponentially with the dimensions of the feature space, wrappers are computationally intractable for high-dimensional gene data. 17 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/chapter/pattern-differentiationsformulations-heterogeneous-genomic/4250?camid=4v1 This title is available in InfoSci-Books, InfoSci-Medical, Business-Technology-Solution, Communications, Social Science, and Healthcare. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=1

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scheduling Single-Load and Multi-Load AGVs in Container Terminals

In this paper, three solutions for scheduling problem of the Single-Load and Multi-Load Automated Guided Vehicles (AGVs) in Container Terminals are proposed. The problem is formulated as Constraint Satisfaction and Optimization. When capacity of the vehicles is one container, the problem is a minimum cost flow model. This model is solved by the highest performance Algorithm, i.e. Network Simple...

متن کامل

Efficient Agrobacterium-Mediated Transformation and Analysis of Transgenic Plants in Hybrid Black Poplar (Populus × euromericana Dode Guinier)

Black poplar (Populus× euramericana Dode Guinier) is an industrially important tree with broad applications in wood and paper, biofuel and cellulose-based industries as well as plant breeding programs and soil phytoremediation approaches. Here, we have focused on development of direct shoot regeneration and Agrobacterium-mediated transformation protocols using the in vitro internodal stem tissu...

متن کامل

Detection of Genetic Differences between Holstein and Iranian North-West Indigenous Hybrid Cattles using Genomic Data

Extended Abstract Introduction and Objective: Selection to increase the frequency of new mutations useful only in some subpopulations leaves markers at the genome level. Most of these regions are related to genes and QTLs controlling significant economic traits. Material and Methods: In order to detection of genetic differences between Iranian northwestern crossbred and Holstein cattle breed,...

متن کامل

A Hybrid Meta-heuristic Approach to Cope with State Space Explosion in Model Checking Technique for Deadlock Freeness

Model checking is an automatic technique for software verification through which all reachable states are generated from an initial state to finding errors and desirable patterns. In the model checking approach, the behavior and structure of system should be modeled. Graph transformation system is a graphical formal modeling language to specify and model the system. However, modeling of large s...

متن کامل

Supervised Inference and Reconstruction of Biological Networks

The vast and fast development of computational and statistical methods has increased the number of applications on reconstructing the structure of large-scale biological networks. Technical feasibility of pattern recognition algorithms and the increasing availability of data repositories provide both challenges and opportunities on reconstruction of biological networks. In this paper, I will pr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015